Backend Python Tutorial#

Welcome to the backend tutorial on how to programmatically use DQMaRC. We begin by importing the necessary libraries including pandas and the DataQuality class. In this case, we will also import some functions from the UtilitiesDQMaRC module, including MetricCalculator, BarPlotGenerator, and DonutChartGenerator. These are supplementary functions for quick visualisation of the data quality markup report. The overall_quality_fx() function calculates the overall quality.

Source Data#

Import Libraries#

import pandas as pd
from pkg_resources import resource_filename

# Import DQMaRC
from DQMaRC import DataQuality

# Import UtilitiesDQMaRC for data quality visualisation
from DQMaRC.UtilitiesDQMaRC import (
    MetricCalculator,
    BarPlotGenerator,
    DonutChartGenerator,
    overall_quality_fx, col_good, col_bad)

DQMaRC: the main package to assess multiple data quality dimensions for a given dataset.
UtilitiesDQMaRC: a suite of modules to support the initial visualisation of the data quality markup report produced by DQMaRC. These utilities are also used to generate the shiny frontend interface.
MetricCalculator: calculates sum totals and averages of data quality errors for each dimension.
BarPlotGenerator: accepts the summary results produced by MetricCalculator to generate a plotly barplot for quick visualisation of data quality averages.
overall_quality_fx: a function to generate an overall average of all data quality metrics and fields.
col_good / col_bad: colour encodings for good and poor data quality percentages.

Data Import#

A screenshot image showing the image icon for synthetic data generator website Mockaroo.

We will use a synthetic dataset generated by Mockaroo and read it in as a pandas DataFrame. This data is included in the package but is purely synthetic.

# Read in example data as pandas dataframe
#df = pd.read_csv('../DQMaRC/data/toydf_subset.csv')
df = pd.read_csv(resource_filename('DQMaRC', 'data/toydf_subset.csv'))
df

	Patient_ID	Gender	Date_of_Diagnosis	Date_of_Birth	Age	Weight_kg	Height_cm	BMI	Postcode	ICD_10_Code	Tumour_Stage	Metastatic_Indicator	Tumour_M_Stage	Regimen_Start_Date	Datetime_Event1	Datetime_Logging1	Datetime_Event2	Datetime_Logging2
0	1	Male	23/04/2033	07/06/2090	72.0	NaN	220.50	0.0	BS5 0NQ	Y743	Incomplete Stage	Present	M0	04/10/2025	13/03/2024 00:00	13/03/2024 00:07	16/03/2024 00:00	16/03/2024 00:53
1	1	Male	NaN	07/06/2090	181.0	882.01	184.53	68.0	BT78 3PN	Y743	Incomplete Stage2	Present	M0	04/10/2025	25/03/2024 00:00	25/03/2024 00:02	19/03/2024 00:00	19/03/2024 01:11
2	2	.	06/02/2020	NaN	140.0	414.60	170.05	67.0	UB7 0JP	Unkown	Incomplete Stage2	Present	M0	24/08/2021	21/03/2024 00:00	21/03/2024 00:01	04/03/2024 00:00	04/03/2024 00:18
3	3	Male	30/07/2021	24/08/2003	151.0	703.49	30.99	39.0	MK9 3NZ		Incomplete Stage2	Present	M0	NaN	12/03/2024 00:00	12/03/2024 00:08	30/03/2024 00:00	30/03/2024 01:28
4	4	Male	17/09/2010	13/07/1999	NaN	804.32	7.48	55.0	B63 3QX		Incomplete Stage2	Present	M0	24/07/2015	20/03/2024 00:00	20/03/2024 00:07	29/03/2024 00:00	29/03/2024 00:47
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
104	104	Female	26/11/2010	01/02/1979	122.0	34.63	159.07	98.0	YO10 3JQ	O3691X1	Stage 2	Present	M0	10/12/2018	1991/03/2024 00:00:00	23/03/2024 00:00	13/03/2024 00:00	01-01-1122 01:25:00
105	105	Female	30/07/2011	08/10/1987	163.0	598.80	NaN	7.0	ME15 6TB	T23129D	Stage3	Present	M0	12/09/2015	1991/03/2024 00:00:00	23/03/2024 00:00	13/03/2024 00:00	01-01-1122 01:25:00
106	106	Female	15/11/2014	17/10/1925	81.0	53.69	226.65	54.0	L5 2QA	M778	Stage 4	Present	M0	26/03/2017	1991/03/2024 00:00:00	23/03/2024 00:00	13/03/2024 00:00	01-01-1122 01:25:00
107	107	Female	15/03/2018	25/10/1974	45.0	788.70	105.50	16.0	N15 4GE	D160	Stage 3	Present	M0	10/06/2015	1991/03/2024 00:00:00	23/03/2024 00:00	13/03/2024 00:00	01-01-1122 01:25:00
108	108	Male	13/10/2017	02/06/1948	145.0	147.52	107.68	18.0	KY5 0NE	S66117D	Stage3	Present	M0	01/09/2019	1991/03/2024 00:00:00	23/03/2024 00:00	13/03/2024 00:00	01-01-1122 01:25:00

109 rows × 18 columns

Test Parameters Setup#

Initialise DQMaRC#

DQMaRC must be initialised by applying the Data Quality class to your source dataset. DQMaRC then requires configuration of the test_params attribute. This is a dataset that DQMaRC uses to map which test parameters (or data quality tests) should be applied to the relevant fields or variables from the source dataset. Each test parameter is functionally supported by a respective method provided by DQMaRC. For example, the Completeness_NULL test parameter is calculated by the test_null() method.

If this is your first time using DQMaRC, you can automatically generate a test_params template using the get_param_template() method. This template is based off the source dataset and automatically activates the Completeness and Uniqueness parameters for all source data fields. In other words, users can run DQMaRC using the auto-generated template to see what the outputs look like and determine whether this is the right tool for the job.

We strongly encourage users to get familiar with the test parameters and to take the time to customise it. All test parameters are listed and explained in a table below. We recommend customising them using Excel due to ease of use, but python users also have the option to customise test parameters programmatically. We demonstrate how to this here.

# Initialise a DQ object by passing your data to the tool
dq = DataQuality(df)

# Retrieve default test parameter form the object. We will edit this in the next step
test_params = dq.get_test_params()

# View the test parameters template
test_params

Warning - Using default parameters. For best results edit parameters and reload using set_test_params() method

	Field	Completeness_NULL	Completeness_Empty	Completeness_Encoded	Completeness_Encoded_Mapping	Uniqueness_Rows	Consistency_Compare	Consistency_Compare_Field	Consistency_Compare_Mapping	...	Validity_NHS_Number	Validity_Postcode_UK	Validity_Lookup_Table	Validity_Lookup_Type	Validity_Lookup_Codes	Validity_Range	Validity_Range_Numeric	Validity_Pattern	Validity_Pattern_Regex	Accuracy_Gold_Standard
0	Patient_ID	True	True	False	NaN	True	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	False	NaN	False
1	Gender	True	True	False	NaN	True	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	False	NaN	False
2	Date_of_Diagnosis	True	True	False	NaN	True	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	False	NaN	False
3	Date_of_Birth	True	True	False	NaN	True	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	False	NaN	False
4	Age	True	True	False	NaN	True	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	False	NaN	False
5	Weight_kg	True	True	False	NaN	True	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	False	NaN	False
6	Height_cm	True	True	False	NaN	True	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	False	NaN	False
7	BMI	True	True	False	NaN	True	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	False	NaN	False
8	Postcode	True	True	False	NaN	True	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	False	NaN	False
9	ICD_10_Code	True	True	False	NaN	True	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	False	NaN	False
10	Tumour_Stage	True	True	False	NaN	True	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	False	NaN	False
11	Metastatic_Indicator	True	True	False	NaN	True	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	False	NaN	False
12	Tumour_M_Stage	True	True	False	NaN	True	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	False	NaN	False
13	Regimen_Start_Date	True	True	False	NaN	True	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	False	NaN	False
14	Datetime_Event1	True	True	False	NaN	True	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	False	NaN	False
15	Datetime_Logging1	True	True	False	NaN	True	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	False	NaN	False
16	Datetime_Event2	True	True	False	NaN	True	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	False	NaN	False
17	Datetime_Logging2	True	True	False	NaN	True	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	False	NaN	False

18 rows × 30 columns

Test Parameters Configurations#

Now we will demonstrate how to edit the test parameters programmatically. You can also export the test parameters template as a csv file and edit them in MS Excel.

Datetime Format#

The first test parameter to edit is the Date_Format. Although this is not strictly speaking a data quality metric, it does help python to correctly identify and calculate other data quality metrics for datetime fields.

# Datetime format
test_params.loc[test_params['Field']=='Date_of_Diagnosis', 'Date_Format'] = "%d/%m/%Y"
test_params.loc[test_params['Field']=='Date_of_Birth', 'Date_Format'] = "%d/%m/%Y"
test_params.loc[test_params['Field']=='Datetime_Event1', 'Date_Format'] = "%d/%m/%Y %H:%M"
test_params.loc[test_params['Field']=='Datetime_Logging1', 'Date_Format'] = "%d/%m/%Y %H:%M"

# Another way to do this:
# test_params.at[3,'Date_Format']="%d/%m/%Y" # Date of Diagnosis
# test_params.at[4,'Date_Format']="%d/%m/%Y" # Date of Birth
# test_params.at[15,'Date_Format']="%d/%m/%Y %H:%M" # Datetime_Event1
# test_params.at[16,'Date_Format']="%d/%m/%Y %H:%M" # Datetime_Logging1

Completeness Parameters#

The first data quality dimension we will edit is Completeness. When you initialise the test parameters template, DQMaRC assumes that the user may wish to calculate the number of NULL or Empty records for all fields. As such, the default values for Completeness_NULL (see the test_null() method) and Completeness_Empty (see the test_empty() method) are TRUE. We encourage the user to think of or look for other values that may signify missing data.

For example, values such as “Unknown”, “Not Known”, or “na” may indicate data missingness. To do this, we set the test parameter Completeness_Encoded == TRUE (see the test_na_strings() method) for the field in question (in this case it is “Gender”). Then we include the relevant missing codes in the Completeness_Encoded_Mapping. Each unique possible value or string encoding for missing data is separated by a pipe, e.g. ‘Unknown|Not Known|na’ (i.e. Unknown or Not Known or na)

# Completeness
# Set to true for all fields
test_params['Completeness_NULL'] = True # Default value
test_params['Completeness_Empty'] = True # Default value

# Set Completeness Encoding only for the "Gender" column
test_params.at[2,'Completeness_Encoded']=True
test_params.at[2,'Completeness_Encoded_Mapping']='Unknown|Not Known|na' # use pipes for multiple possible codes for missingness

Uniqueness Parameters#

Uniqueness evaluates if records are unique across the fields that are set to TRUE. In most cases, we assume that records should be unique across at least all fields combined. Therefore, DQMaRC by default sets the parameter Uniqueness_rows == TRUE for all fields. However, if records are expected to be unique across select fields, then we encourage the user to set TRUE only to those respective fields, and FALSE to the remaining fields.

# Uniqueness
# Set to true for all fields
test_params['Uniqueness_rows'] = True # Default value

Consistency Parameters#

Consistency is the evaluation of whether data between two fields in the same dataset are consistent as expected. It consists of two key metrics, including Consistency_Compare (see the test_one_to_one() method), which is the comparison of values between two fields, and Consistency_Date_Relations (see the date_relationships() method), which is a check of consistency in logical relationships between two datetime fields.

We demonstrate the first consistency metric by comparing Metastatic_Indicator with Tumour_M_Stage. In theory, both variables indicate whether cancer has spread to distant locations, but in real-world it is possible that these may come from different sources. The values in both fields are recorded differently, but mean similar things. For example, if Metastatic_Indicator == Absent, then we expect that Tumour_M_Stage == M0, meaning no metastasis is present. If Metastantic_Indicator == Present, then we expect to see Tumour_M_Stage with either M1, M1a, or other values, but here we abbreviate it to keep things simple.

For datetime fields, we may expect one date to occur before or after another in the same dataset. In this example, we demonstrate this using Date_of_Birth which obviously should occur before Date_of_Death. Importantly, we tell DQMaRC to raise an error if it detects that: Date_of_Birth > Date_of_Death.

# Consistency
# Set up consistency checks between the "Metastatic_Indicator" and "Tumour_M_Stage" columns
test_params.loc[test_params['Field']=='Metastatic_Indicator', 'Consistency_Compare'] = True
test_params.loc[test_params['Field']=='Metastatic_Indicator', 'Consistency_Compare_Field'] = 'Tumour_M_Stage'
test_params.loc[test_params['Field']=='Metastatic_Indicator', 'Consistency_Compare_Mapping'] = '{["Absent"]: ["M0"], ["Present"]: ["M1", "M1a"]}'

# Set up consistency checks between date fields "Date_of_Birth" and "Date_of_Diagnosis"
test_params.loc[test_params['Field']=='Date_of_Birth', 'Consistency_Date_Relations'] = True
test_params.loc[test_params['Field']=='Date_of_Birth', 'Consistency_Date_Relations_Field'] = 'Date_of_Diagnosis'
test_params.loc[test_params['Field']=='Date_of_Birth', 'Consistency_Date_Relationship'] = '>' # i.e. raise an error if Date of Birth > Date of Diagnosis

Timeliness Parameters#

Timeliness describes how fresh or up-to-date data is. One way to measure this is to define a threshold in time difference (see the date_diff_calc() method) between fields that record the timing of events and when those events are captured into the system. For example, in this case, the Datetime_Event1 field signifies the time that a user says an event occured, or when that data was observed in the real-world, whereas Datetime_Logging1 is the electronically generated datetime log data indicating when the user entered this data into the system. In a clinical or health setting, the freshness of data may be crucial.

For example, in critical care, patient observation data may be recorded several minutes or hours after the time of observation, instead of immediately. To flag these errors, we tell DQMaRC to raise an error if the time difference between Datetime_Event1 and Datetime_Logging1 is greater than 10 minutes.

# Timeliness
# Raise an error if a time difference threshold of 10 minutes is exceeded
test_params.loc[test_params['Field']=='Datetime_Event1', 'Timeliness_Date_Diff'] = True
test_params.loc[test_params['Field']=='Datetime_Event1', 'Timeliness_Date_Diff_Field'] = 'Datetime_Logging1'
test_params.loc[test_params['Field']=='Datetime_Event1', 'Timeliness_Date_Diff_Threshold'] = '10' # i.e. raise an error if timediff >10 minutes

Validity Parameters#

Data Validity describes data that conform to the expected standards, patterns, or ranges that reflect the intended real-world objects. DQMaRC has 7 validity metrics that are shown below. These include:

Future Dates: flag a record containing a date that occur in the future (see Validity_Dates_Future() method).

# (1) Future Dates
test_params.loc[test_params['Field']=='Date_of_Diagnosis', 'Validity_Dates_Future'] = True
test_params.loc[test_params['Field']=='Date_of_Birth', 'Validity_Dates_Future'] = True

Date Outliers: checks if date records fall outside expected date ranges (see min_max_dates() method).

# (2) Date Outliers
test_params.loc[test_params['Field']=='Date_of_Diagnosis', 'Validity_Date_Range'] = True
test_params.loc[test_params['Field']=='Date_of_Birth', 'Validity_Date_Range'] = True

test_params.loc[test_params['Field']=='Date_of_Diagnosis', 'Validity_Date_Range_Min'] = '2011-01-01'
test_params.loc[test_params['Field']=='Date_of_Birth', 'Validity_Date_Range_Min'] = '1900-01-01'

test_params.loc[test_params['Field']=='Date_of_Diagnosis', 'Validity_Date_Range_Max'] = '2023-07-07'
test_params.loc[test_params['Field']=='Date_of_Birth', 'Validity_Date_Range_Max'] = '2023-01-01'

Numerical Outliers: checks if numerical data fall outside expected numerical ranges (see test_ranges() method).

# (3) Numerical Outliers
test_params.loc[test_params['Field']=='Age', 'Validity_Range'] = True
test_params.loc[test_params['Field']=='Age', 'Validity_Range_Min'] = 0
test_params.loc[test_params['Field']=='Age', 'Validity_Range_Max'] = 120

test_params.loc[test_params['Field']=='Height_cm', 'Validity_Range'] = True
test_params.loc[test_params['Field']=='Height_cm', 'Validity_Range_Min'] = 20
test_params.loc[test_params['Field']=='Height_cm', 'Validity_Range_Max'] = 210

NHS Number Validator: checks if NHS numbers are valid (see validate_nhs_number() method).

# (4) NHS Number Validator
# test_params.loc[test_params['Field']=='NHS_Number', 'Validity_NHS_Number'] = True

UK Postcode Validator: checks if UK postcodes are valid (see test_postcode() method).

# (5) UK Postcode Validator
test_params.loc[test_params['Field']=='Postcode', 'Validity_Postcode_UK'] = True

Data Standards: checks if values conform to expected permissible values as defined by a data standard (see test_against_lookup_tables() method).

Either create and save your own data standard

# (6a) Data Standards
test_params.loc[test_params['Field']=='Tumour_M_Stage', 'Validity_Lookup'] = True
test_params.loc[test_params['Field']=='Tumour_M_Stage', 'Validity_Lookup_Type'] = 'Values'
test_params.loc[test_params['Field']=='Tumour_M_Stage', 'Validity_Lookup_Codes'] = 'M0|M1|M1b|pM1'

Or access a pre-defined list saved as a csv file

#(6b) Use an external csv data standard list of valid codes
lu_filename = '..DQMaRC/data/lookups/LU_toydf_gender.csv'

# Here we will apply a pre-defined data standard for "gender"
test_params.loc[test_params['Field']=='Gender', 'Validity_Lookup'] = True
test_params.loc[test_params['Field']=='Gender', 'Validity_Lookup_Type'] = 'File'
test_params.loc[test_params['Field']=='Gender', 'Validity_Lookup_Codes'] = lu_filename

Regular Expression Pattern: checks if data conform to expected pattern as defined by regular expression (see test_pattern_validity() method).

# (7) Regular Expression Pattern
test_params.loc[test_params['Field']=='Datetime_Event1', 'Validity_Pattern'] = True
test_params.loc[test_params['Field']=='Datetime_Event1', 'Validity_Pattern_Regex'] = "(\d{2})/(\d{2})/(\d{4}) (\d{2}):(\d{2})"

test_params.loc[test_params['Field']=='Datetime_Event2', 'Validity_Pattern'] = True
test_params.loc[test_params['Field']=='Datetime_Event2', 'Validity_Pattern_Regex'] = "[0-9]{2}/[0-9]{2}/[0-9]{4} [0-9]{2}:[0-9]{2}"

Accuracy Parameters#

Accuracy measures the consistency of input data compared to a known gold standard. In this case, we will set the gold standard as the input data itself to demonstrate the functionality.

# Accuracy
# Set a manually validated version of the data set as the gold standard
test_params['Gold_Standard'] = True

# supply gold stand data - we are using the same dataset here for ease
dq.accuracy.set_gold_standard(df)

Test Parameters Definitions#

Here is a list of all test parameters and their respective descriptions and permissible configurations.

# Read in example data as pandas dataframe
test_params_definitions = pd.read_csv('../DQMaRC/data/test_params_definitions.csv')
test_params_definitions

	Test Parameter	Python_Data_Type	Example_Data	Description
0	Field	String	Metastatic_Indicator	Name of the fields of the source input data.
1	Date_Format	String	%d/%m/%Y %H:%M	The datetime format pattern specified by ISO s...
2	Completeness_NULL	Boolean	TRUE or FALSE	Count of "NULL" values.
3	Completeness_Empty	Boolean	TRUE or FALSE	Count of empty records.
4	Completeness_Encoded	Boolean	TRUE or FALSE	TRUE/FALSE indicator if field uses codes to in...
5	Completeness_Encoded_Mapping	String	Missing\|N/A\|Empty	A list of values or codes that represent missi...
6	Uniqueness_Rows	Boolean	TRUE or FALSE	TRUE/FALSE indicator if field is expected to b...
7	Consistency_Compare	Boolean	TRUE or FALSE	TRUE/FALSE to indicate if field should be comp...
8	Consistency_Compare_Field	String	Tumour_M_Stage	Enter the name of the field to compare against...
9	Consistency_Compare_Mapping	Pandas dictionary	{"Present": ["M1", "M1a"]}	Map the permissible correct comparison values ...
10	Consistency_Date_Relations	Boolean	TRUE or FALSE	TRUE/FALSE if a date field should be compared ...
11	Consistency_Date_Relationship	String	<, >, <=, >=	Relationship of the comparison (<, >, <=, or >...
12	Consistency_Date_Relations_Field	String	Date_of_Birth	Name of the compared datetime field if "Consis...
13	Timeliness_Date_Diff	Boolean	TRUE or FALSE	Indicator to perform a calculation between two...
14	Timeliness_Date_Diff_Field	String	Datetime_Logging1	If "Timeliness_Date_Diff" = "TRUE", here you s...
15	Timeliness_Date_Diff_Threshold	Integer	10	What is the permissible threshold (in minutes)...
16	Validity_Dates_Future	Boolean	TRUE or FALSE	TRUE/FALSE indicator to raise an error if a fu...
17	Validity_Date_Range	Boolean	TRUE or FALSE	TRUE/FALSE if a date range is applicable.
18	Validity_Date_Range_Min	String	01/01/1900	If "Validity_Dates_Future" = "TRUE", what is t...
19	Validity_Date_Range_Max	String	01/01/2025	If "Validity_Dates_Future" = "TRUE", what is t...
20	Validity_NHS_Number	Boolean	TRUE or FALSE	TRUE/FALSE if the NHS validity algorithm check...
21	Validity_Postcode_UK	Boolean	TRUE or FALSE	TRUE/FALSE to raise an error if invalid UK pos...
22	Validity_Lookup_Table	Boolean	TRUE or FALSE	TRUE/FALSE if a data standard is applicable.
23	Validity_Lookup_Table_Filename	String	LU_toydf_ICD10_v5.csv	If "Validity_Lookup_Table"="TRUE", what is the...
24	Validity_Range	Boolean	TRUE or FALSE	TRUE/FALSE if a numerical range applicable, e....
25	Validity_Range_Numeric	String	0\|110	If "Validity_Range"="TRUE", supply the numeric...
26	Validity_Pattern	Boolean	TRUE or FALSE	TRUE/FALSE if a valid pattern is applicable.
27	Validity_Pattern_Regex	String	[0-9]{2}/[0-9]{2}/[0-9]{2}	If "Validity_Pattern"="TRUE", supply the permi...
28	Accuracy_Gold_Standard	Boolean	TRUE or FALSE	TRUE/FALSE if a gold standard reference datase...

Upload Custom Test Parameters#

We have shown how you can programmatically edit the test parameters dataframe. However, users may opt to make these edits in Excel instead. In fact, we encourage users to spend the time to do this to maximise the relevance of the data quality output reports. Here we show how you can upload a pre-defined test parameters=.

test_params_upload = pd.read_csv(resource_filename('DQMaRC', 'data/toydf_subset_test_params_24.05.16.csv'))
test_params_upload

	Field	Date_Format	Completeness_NULL	Completeness_Empty	Completeness_Encoded	Completeness_Encoded_Mapping	Uniqueness_Rows	Consistency_Compare	Consistency_Compare_Field	Consistency_Compare_Mapping	...	Validity_NHS_Number	Validity_Postcode_UK	Validity_Lookup_Table	Validity_Lookup_Type	Validity_Lookup_Codes	Validity_Range	Validity_Range_Numeric	Validity_Pattern	Validity_Pattern_Regex	Accuracy_Gold_Standard
0	Patient_ID	NaN	True	True	False	NaN	True	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	False	NaN	False
1	Gender	NaN	True	True	True	.\|unknown\|null\|not known\|na\|n/a\|none	False	False	NaN	NaN	...	False	False	True	File	LU_toydf_gender.csv	False	NaN	False	NaN	False
2	Date_of_Diagnosis	%d/%m/%Y	True	True	False	NaN	False	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	True	(\d{2})/(\d{2})/(\d{4})	False
3	Date_of_Birth	%d/%m/%Y	True	True	False	NaN	False	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	True	(\d{2})/(\d{2})/(\d{4})	False
4	Age	NaN	True	True	False	NaN	False	False	NaN	NaN	...	NaN	NaN	False	NaN	NaN	True	0\|110	False	NaN	False
5	Weight_kg	NaN	True	True	False	NaN	False	False	NaN	NaN	...	NaN	NaN	False	NaN	NaN	True	5\|150	False	NaN	False
6	Height_cm	NaN	True	True	False	NaN	False	False	NaN	NaN	...	NaN	NaN	False	NaN	NaN	True	10\|200	False	NaN	False
7	BMI	NaN	True	True	False	NaN	False	False	NaN	NaN	...	NaN	NaN	False	NaN	NaN	True	5\|60	False	NaN	False
8	Postcode	NaN	True	True	True	.\|unknown\|null\|not known\|na\|n/a\|none	False	False	NaN	NaN	...	False	True	False	NaN	NaN	False	NaN	False	NaN	False
9	ICD_10_Code	NaN	True	True	True	.\|unknown\|Unknown\|null\|not known\|na\|n/a\|N/A\|none	False	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	False	NaN	False
10	Tumour_Stage	NaN	True	True	True	None\|N/A\|Incomplete Stage\|Incomplete Stage2	False	False	NaN	NaN	...	False	False	True	File	LU_toydf_tumour_stage.csv	False	NaN	False	NaN	False
11	Metastatic_Indicator	NaN	True	True	False	NaN	False	True	Tumour_M_Stage	{"Absent": ["M0"], "Present": ["M1", "M1a", "M...	...	False	False	False	NaN	NaN	False	NaN	False	NaN	False
12	Tumour_M_Stage	NaN	True	True	False	NaN	False	False	NaN	NaN	...	False	False	True	Values	M0\|M1\|M1b\|pM1	False	NaN	False	NaN	False
13	Regimen_Start_Date	%d/%m/%Y	True	True	False	NaN	False	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	True	[0-9]{2}/[0-9]{2}/[0-9]{2}	False
14	Datetime_Event1	%d/%m/%Y %H:%M	True	True	False	NaN	False	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	True	(\d{2})/(\d{2})/(\d{4}) (\d{2}):(\d{2})	False
15	Datetime_Logging1	%d/%m/%Y %H:%M	True	True	False	NaN	False	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	True	(\d{2})/(\d{2})/(\d{4}) (\d{2}):(\d{2})	False
16	Datetime_Event2	%d/%m/%Y %H:%M	True	True	False	NaN	False	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	True	[0-9]{2}/[0-9]{2}/[0-9]{4} [0-9]{2}:[0-9]{2}	False
17	Datetime_Logging2	%d/%m/%Y %H:%M	True	True	False	NaN	False	False	NaN	NaN	...	False	False	False	NaN	NaN	False	NaN	True	(\d{2})/(\d{2})/(\d{4}) (\d{2}):(\d{2})	False

18 rows × 30 columns

Set Test Parameters#

Now we are ready to set the test parameters. We are using the pre-defined uploaded test parameters.

dq.set_test_params(test_params_upload)

Using uploaded test parameters

Run DQMaRC#

Once you have set your chosen test parameters, you run DQMaRC by using the run_all_metrics() method. Users can optionally run each data quality dimension separately as indicated in the commented code below.

dq.run_all_metrics()

# To run separately use following methods.
# dq.completeness.run_metrics()
# dq.uniqueness.run_metrics()
# dq.consistency.run_metrics()
# dq.timeliness.run_metrics()
# dq.validity.run_metrics()
# dq.accuracy.run_metrics()

Get DQ Results#

Now you can export your results. The two main DQ reports generated by DQMaRC include the full cell-level markup and the field-wise aggregated results.

Full Results#

The deepest, and potentially most useful output results is the data quality markup called DQMaRC.DataQuality.raw_results(). This includes a cell-level binary markup of data quality errors, or flags. You can join the output dataset from DQMaRC.DataQuality.raw_results() to the original source dataset by the index.

raw = dq.raw_results()
raw

# The full results can be joined to the source data by the index.
# source_df_raw = df.join(raw)

	Completeness_NULL_\|_Patient_ID	Completeness_NULL_\|_Gender	Completeness_NULL_\|_Date_of_Diagnosis	Completeness_NULL_\|_Date_of_Birth	Completeness_NULL_\|_Age	Completeness_NULL_\|_Weight_kg	Completeness_NULL_\|_Height_cm	Completeness_NULL_\|_BMI	Completeness_NULL_\|_Postcode	Completeness_NULL_\|_ICD_10_Code	...	validity_count_\|_Datetime_Event2	validity_count_\|_Datetime_Logging1	validity_count_\|_Datetime_Logging2	validity_count_\|_Gender	validity_count_\|_Height_cm	validity_count_\|_Postcode	validity_count_\|_Regimen_Start_Date	validity_count_\|_Tumour_M_Stage	validity_count_\|_Tumour_Stage	validity_count_\|_Weight_kg
0	0	0	0	0	0	1	0	0	0	0	...	0	0	0	0	1	0	1	0	0	0
1	0	0	1	0	0	0	0	0	0	0	...	0	0	0	0	0	0	1	0	0	1
2	0	0	0	1	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
3	0	0	0	0	0	0	0	0	0	0	...	0	1	0	0	0	0	0	0	0	1
4	0	0	0	0	1	0	0	0	0	0	...	0	0	0	0	1	0	0	0	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
104	0	0	0	0	0	0	0	0	0	0	...	0	0	1	0	0	0	0	0	0	0
105	0	0	0	0	0	0	1	0	0	0	...	0	0	1	0	0	0	0	0	1	1
106	0	0	0	0	0	0	0	0	0	0	...	0	0	1	0	1	0	0	0	0	0
107	0	0	0	0	0	0	0	0	0	0	...	0	0	1	0	0	0	0	0	0	1
108	0	0	0	0	0	0	0	0	0	0	...	0	0	1	0	0	0	0	0	1	0

109 rows × 114 columns

Aggregated Results#

You can also access a higher-level aggregation of the raw results. This contains the sum count of data quality errors detected for each test parameter and each source data field. To access this, you can simply use DQMaRC.DataQuality.aggregate_results() method.

agg = dq.aggregate_results()
agg

metric	field	Completeness_NULL	Completeness_Empty	Completeness_Encoded	completeness_count	row_uniqueness	uniqueness_count	Consistency_Compare	Consistency_Date_Relations	consistency_count	Timeliness_Date_Diff	timeliness_count	Validity_Dates_Future	Validity_Date_Range	Validity_Postcode_UK	Validity_Lookup_Table	Validity_Range	Validity_Pattern	validity_count
12	Patient_ID	0.0	0.0	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
8	Gender	11.0	0.0	7.0	18.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN	NaN	0.0
3	Date_of_Diagnosis	2.0	0.0	NaN	2.0	NaN	NaN	NaN	6.0	6.0	NaN	NaN	1.0	79.0	NaN	NaN	NaN	0.0	80.0
2	Date_of_Birth	1.0	0.0	NaN	1.0	NaN	NaN	NaN	6.0	6.0	NaN	NaN	1.0	NaN	NaN	NaN	NaN	0.0	1.0
0	Age	11.0	0.0	NaN	11.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	50.0	NaN	50.0
17	Weight_kg	11.0	0.0	NaN	11.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	84.0	NaN	84.0
9	Height_cm	10.0	0.0	NaN	10.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	34.0	NaN	34.0
1	BMI	9.0	0.0	NaN	9.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	36.0	NaN	36.0
13	Postcode	0.0	0.0	0.0	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2.0	NaN	NaN	NaN	2.0
10	ICD_10_Code	2.0	4.0	1.0	7.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
16	Tumour_Stage	11.0	0.0	10.0	21.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	28.0	NaN	NaN	28.0
11	Metastatic_Indicator	0.0	0.0	NaN	0.0	NaN	NaN	49.0	NaN	49.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
15	Tumour_M_Stage	0.0	0.0	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	7.0	NaN	NaN	7.0
14	Regimen_Start_Date	1.0	0.0	NaN	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2.0	NaN	NaN	NaN	NaN	0.0	2.0
4	Datetime_Event1	0.0	0.0	NaN	0.0	NaN	NaN	NaN	9.0	9.0	19.0	19.0	8.0	NaN	NaN	NaN	NaN	9.0	17.0
6	Datetime_Logging1	0.0	0.0	NaN	0.0	NaN	NaN	NaN	9.0	9.0	NaN	NaN	8.0	NaN	NaN	NaN	NaN	0.0	8.0
5	Datetime_Event2	0.0	0.0	NaN	0.0	NaN	NaN	NaN	2.0	2.0	103.0	103.0	8.0	NaN	NaN	NaN	NaN	0.0	8.0
7	Datetime_Logging2	0.0	0.0	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	8.0	NaN	NaN	NaN	NaN	9.0	17.0
18	full_row_uniqueness	NaN	NaN	NaN	NaN	2.0	2.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Data Quality Visualisation#

Here we demonstrate examples of data visualisations to gain quick high-level overviews of the DQ results.

Overall data quality#

The overall quality is the average data quality for all metrics across all fields from the source dataset. The applied logic is: * if average > 90, return “Outstanding” * if average >= 80, return “Good” * if average >= 60, return “Requires Improvement” * if average <60, return “Inadequate”

# Prepare DQ Dashboard
raw_subset = raw.filter(regex='completeness|validity|consistency|uniqueness_count|accuracy|timeliness')
calculator = MetricCalculator(raw_subset)

# Simulate the calculation step, calculate aggregates
calculator.calculate_metrics()

summary_results = calculator.result
summary_results['Colour_Good'] = summary_results.apply(col_good, axis=1)
summary_results['Colour_Bad'] = summary_results.apply(col_bad, axis=1)
summary_results['Colour_NA'] = '#B2C3C6'

# Overall quality label
from IPython.display import HTML

# Function to display overall quality in a Jupyter Notebook
def display_overall_quality_label():
    if not summary_results.empty:
        data1 = summary_results[summary_results['Prop_NA'] == 0]
        avg_prop_good = data1['Prop_Good'].mean()
        overall_quality_level, background_color, text_colour = overall_quality_fx(avg_prop_good)

        overall_quality_text = f"Overall Quality: {overall_quality_level}"
        html = f"""
        <div style="
            background-color: {background_color};
            padding: 10px;
            border-radius: 5px;
            color: {text_colour};
            border: 2px solid {text_colour};
            text-align: center;
            width: 300px;">
            {overall_quality_text}
        </div>
        """
        return HTML(html)
    else:
        return HTML("<div style='text-align: center;'>No data available</div>")

# Use the function to display the result
display_overall_quality_label()

Overall Quality: Good

Donut Charts#

The donut charts represent the average DQ for all metrics and fields within a given DQ dimension.

DonutChartGenerator(summary_results).plot_donut_charts()

Bar Chart Completeness#

The barcharts represent the average DQ for each field of a given DQ dimension. Here we show all relevant fields’ average completeness quality.

BarPlotGenerator(summary_results, "completeness").plot_bar()

Bar Chart Consistency#

Here we show all relevant fields’ average consistency.

BarPlotGenerator(summary_results, "consistency").plot_bar()

Bar Chart Timeliness#

Here we show all relevant fields’ average timeliness.

BarPlotGenerator(summary_results, "timeliness").plot_bar()

Bar Chart Uniqueness#

Uniqueness indicates if records are unique across the combination of fields chosen.

BarPlotGenerator(summary_results, "uniqueness").plot_bar()

Bar Chart Validity#

Here we show all relevant fields’ average validity.

BarPlotGenerator(summary_results, "validity").plot_bar()

Example Errors by DQ Dimension#

First we must join the source data with the data quality markup called raw. We can do this using a pandas join method. Now, the df_DQ_full dataframe contains both source data and the data quality markup.

# Join source data to the full DQ markup results
df_DQ_full = df.join(raw, how="left")

Completeness Examples#

In this example, we showcase completeness errors present in the Gender variable of the source dataset. Values such as ., Na, unknown are flagged as errors, which are represented as 1 in the adjacent fields such as Completeness_NULL_|_Gender, Completeness_Empty_|_Gender and Completeness_NULL_|_Encoded. The completeness_count_|_Gender variable is the sum of these three DQ metrics for each variable (i.e. Gender in this case).

gender_completeness_conditions = (df_DQ_full['Completeness_NULL_|_Gender']>0) | \
    (df_DQ_full['Completeness_Empty_|_Gender']>0) | \
        (df_DQ_full['Completeness_Encoded_|_Gender']>0)

df_DQ_full[['Gender','Completeness_NULL_|_Gender',
            'Completeness_Empty_|_Gender', 'Completeness_Encoded_|_Gender',
            'completeness_count_|_Gender']].loc[(gender_completeness_conditions)]

	Gender	Completeness_NULL_\|_Gender	Completeness_Encoded_\|_Gender	completeness_count_\|_Gender
2	.	0	1	1
7	.	0	1	1
10	NaN	1	0	1
11	NaN	1	0	1
15	.	0	1	1
26	.	0	1	1
27	NaN	1	0	1
31	unknown	0	1	1
32	unknown	0	1	1
33	unknown	0	1	1
37	NaN	1	0	1
46	NaN	1	0	1
55	NaN	1	0	1
57	NaN	1	0	1
69	NaN	1	0	1
82	NaN	1	0	1
91	NaN	1	0	1
101	NaN	1	0	1

Uniqueness Examples#

Uniqueness evaluates duplicate records for variables where we expect them to be unique. Here you can see that the row_uniqueness_|_full_row_uniqueness variable indicates 1 if a duplicate record is present across the Patient_ID and Gender source variables.

# Check which rows have duplication where they should be unique
df_DQ_full[['Patient_ID', 'Gender']][df_DQ_full[['Patient_ID', 'Gender']].duplicated()]

# Show how uniqueness flags are shown
uniqueness_conditions = (df_DQ_full['Patient_ID']==1) | (df_DQ_full['Patient_ID']==10)

df_DQ_full[['Patient_ID', 'Gender',
            'row_uniqueness_|_full_row_uniqueness']].loc[(uniqueness_conditions)]

	Patient_ID	Gender	row_uniqueness_\|_full_row_uniqueness
0	1	Male	1
1	1	Male	0
10	10	NaN	1
11	10	NaN	0

Consistency Examples#

We demonstrate consistency errors below by comparing source variables Metastantic_Indicator and Tumour_M_Stage. In the full DQ report, the variable Consistency_Compare_|_Metastantic_Indicator contains 1 if an inconsistency is detected between Metastantic_Indicator and Tumour_M_Stage based on the mapping provided in the test_params data.

consistency_conditions_metastatic_indicator = (df_DQ_full['Consistency_Compare_|_Metastatic_Indicator']>0)

df_DQ_full[['Tumour_M_Stage','Metastatic_Indicator',
            'Consistency_Compare_|_Metastatic_Indicator',
            'consistency_count_|_Metastatic_Indicator']].loc[(consistency_conditions_metastatic_indicator)]

	Tumour_M_Stage	Metastatic_Indicator	Consistency_Compare_\|_Metastatic_Indicator	consistency_count_\|_Metastatic_Indicator
0	M0	Present	1	1
1	M0	Present	1	1
2	M0	Present	1	1
3	M0	Present	1	1
4	M0	Present	1	1
5	M0	Present	1	1
6	M0	Present	1	1
7	M0	Present	1	1
8	M0	Present	1	1
9	M0	Present	1	1
10	M1	Absent	1	1
11	M1	Absent	1	1
12	M1	Absent	1	1
13	M1	Absent	1	1
14	M1a	Absent	1	1
15	pM1a	Absent	1	1
16	M1a	Absent	1	1
17	M1b	Absent	1	1
18	M1c	Absent	1	1
19	pM1a	Absent	1	1
80	M0	Present	1	1
81	M0	Present	1	1
82	M0	Present	1	1
83	M0	Present	1	1
84	M0	Present	1	1
85	M0	Present	1	1
86	M0	Present	1	1
87	M0	Present	1	1
88	M0	Present	1	1
89	M0	Present	1	1
90	M0	Present	1	1
91	M0	Present	1	1
92	M0	Present	1	1
93	M0	Present	1	1
94	M0	Present	1	1
95	M0	Present	1	1
96	M0	Present	1	1
97	M0	Present	1	1
98	M0	Present	1	1
99	M0	Present	1	1
100	M0	Present	1	1
101	M0	Present	1	1
102	M0	Present	1	1
103	M0	Present	1	1
104	M0	Present	1	1
105	M0	Present	1	1
106	M0	Present	1	1
107	M0	Present	1	1
108	M0	Present	1	1

We also demonstrate a consistency check between date of birth and date of diagnosis.

consistency_conditions_dates = (df_DQ_full['Consistency_Date_Relations_|_Date_of_Diagnosis']>0) | \
    (df_DQ_full['Consistency_Date_Relations_|_Date_of_Birth']>0)

df_DQ_full[['Date_of_Birth','Date_of_Diagnosis',
            'Consistency_Date_Relations_|_Date_of_Birth',
            'Consistency_Date_Relations_|_Date_of_Diagnosis']].loc[(consistency_conditions_dates)]

	Date_of_Birth	Date_of_Diagnosis	Consistency_Date_Relations_\|_Date_of_Birth	Consistency_Date_Relations_\|_Date_of_Diagnosis
0	07/06/2090	23/04/2033	1	1
19	25/10/2015	12/07/2011	1	1
36	10/07/2017	31/05/2015	1	1
66	04/07/2014	26/01/2010	1	1
67	13/01/2020	07/12/2010	1	1
73	28/01/2015	01/02/2014	1	1

Timeliness Examples#

Here we demonstrate example timeliness errors by comparing the datetime difference between Datetime_Event2 and Datetime_Logging2. According to the customised test_parameters, we set a threshold of 10 minutes to flag an error if the datetime difference is exceeded. In otherwords, if the difference in time between these variables is greater than 10 minutes, an error, i.e. 1, is present in the Timeliness_Date_Diff_|_Datetime_Event2 field of the DQ report.

timeliness_conditions = (df_DQ_full['Timeliness_Date_Diff_|_Datetime_Event2']>0)
df_DQ_full[['Datetime_Event2','Datetime_Logging2',
            'Timeliness_Date_Diff_|_Datetime_Event2',
            'timeliness_count_|_Datetime_Event2']].loc[(timeliness_conditions)]

	Datetime_Event2	Datetime_Logging2	Timeliness_Date_Diff_\|_Datetime_Event2	timeliness_count_\|_Datetime_Event2
0	16/03/2024 00:00	16/03/2024 00:53	1	1
1	19/03/2024 00:00	19/03/2024 01:11	1	1
2	04/03/2024 00:00	04/03/2024 00:18	1	1
3	30/03/2024 00:00	30/03/2024 01:28	1	1
4	29/03/2024 00:00	29/03/2024 00:47	1	1
...	...	...	...	...
104	13/03/2024 00:00	01-01-1122 01:25:00	1	1
105	13/03/2024 00:00	01-01-1122 01:25:00	1	1
106	13/03/2024 00:00	01-01-1122 01:25:00	1	1
107	13/03/2024 00:00	01-01-1122 01:25:00	1	1
108	13/03/2024 00:00	01-01-1122 01:25:00	1	1

103 rows × 4 columns

Validity Examples#

Here we demonstrate example errors of data validity across all 7 validity metrics, including: future dates, outlier dates, invalid NHS numbers, invalid codes, numerical outlires, and invalid patterns.

Future Dates#

Future dates represent if a datetime field occurs in the future relative to the current datetime. This cannot detect a future date if it occurred in the past. Instead, this may be detected as an outlier date or inconsistent date instead.

validity_future_dates_conditions = (df_DQ_full['Validity_Dates_Future_|_Date_of_Diagnosis']>0)
df_DQ_full[['Date_of_Diagnosis','Validity_Dates_Future_|_Date_of_Diagnosis']].loc[(validity_future_dates_conditions)]

	Date_of_Diagnosis	Validity_Dates_Future_\|_Date_of_Diagnosis
0	23/04/2033	1

Outlier Dates#

Outlier dates are dates that occur outside the ranges as set in the test parameters.

validity_outlier_dates_conditions = (df_DQ_full['Validity_Date_Range_|_Date_of_Diagnosis']>0)
df_DQ_full[['Date_of_Diagnosis','Validity_Date_Range_|_Date_of_Diagnosis']].loc[(validity_outlier_dates_conditions)]

	Date_of_Diagnosis	Validity_Date_Range_\|_Date_of_Diagnosis
0	23/04/2033	1
4	17/09/2010	1
6	09/01/2013	1
7	19/06/2011	1
8	13/10/2012	1
...	...	...
104	26/11/2010	1
105	30/07/2011	1
106	15/11/2014	1
107	15/03/2018	1
108	13/10/2017	1

79 rows × 2 columns

Invalid NHS Numbers#

Invalid NHS numbers are flagged when NHS numbers do not meet the requirements as set by the NHS validation algorithm. Here is how you would check in your dataset which NHS numbers were invalid. We excluded this from our example to minimise the risk of sharing potentially true NHS numbers even when synthetically generated.

#validity_NHS_number_conditions = (df_DQ_full['Validity_NHS_Number_|_NHS_number']>0)
#df_DQ_full[['NHS_number','Validity_NHS_Number_|_NHS_number']].loc[(validity_NHS_number_conditions)]

Invalid UK Postcodes#

Invalid UK postcodes can be detected by the UK postcode validation algorithm. Other countries’ postcodes can be validated using pattern validation as shown below.

validity_UK_postcodes_conditions = (df_DQ_full['Validity_Postcode_UK_|_Postcode']>0)
df_DQ_full[['Postcode','Validity_Postcode_UK_|_Postcode']].loc[(validity_UK_postcodes_conditions)]

	Postcode	Validity_Postcode_UK_\|_Postcode
6	AAA 1AA	1
8	BB1 1ABB	1

Invalid Tumour Stage Codes#

This is one example of a validation test applied to a categorical or coded source variable. In this case, we validate Tumour_Stage codes by comparing against a list of valid codes. We must provide the code list as a csv file and the name of it in the test parameters dataset.

validity_codes_conditions = (df_DQ_full['Validity_Lookup_Table_|_Tumour_Stage']>0)
df_DQ_full[['Tumour_Stage','Validity_Lookup_Table_|_Tumour_Stage']].loc[(validity_codes_conditions)]

	Tumour_Stage	Validity_Lookup_Table_\|_Tumour_Stage
23	Stage3	1
27	Stage3	1
30	Stage3	1
31	stage4	1
34	Stage3	1
38	stage 4	1
42	Stage3	1
45	Stage3	1
46	stage4	1
49	Stage3	1
53	stage 4	1
57	Stage3	1
60	Stage3	1
61	stage4	1
64	Stage3	1
71	Stage3	1
74	Stage3	1
75	stage4	1
78	Stage3	1
82	stage 4	1
86	Stage3	1
90	Stage3	1
93	Stage3	1
94	stage4	1
97	Stage3	1
101	stage 4	1
105	Stage3	1
108	Stage3	1

Numerical Outliers#

Numerical outliers are detected by applying a minimum and maximum numerical range in the test parameters.

validity_numerical_ranges_conditions = (df_DQ_full['Validity_Range_|_Height_cm']>0)
df_DQ_full[['Height_cm','Validity_Range_|_Height_cm']].loc[(validity_numerical_ranges_conditions)]

	Height_cm	Validity_Range_\|_Height_cm
0	220.50	1
4	7.48	1
8	296.97	1
10	210.99	1
12	289.45	1
16	202.02	1
17	227.60	1
19	248.42	1
21	297.19	1
23	249.22	1
25	275.20	1
26	265.48	1
30	217.05	1
32	252.69	1
33	281.95	1
35	283.26	1
39	230.48	1
44	247.59	1
47	253.83	1
49	217.83	1
52	211.40	1
57	240.35	1
62	246.27	1
71	210.46	1
72	255.92	1
74	292.97	1
79	6.35	1
80	256.97	1
82	209.31	1
84	299.91	1
85	231.40	1
92	226.65	1
99	226.65	1
106	226.65	1

Invalid Patterns#

The pattern validation check uses regular expression pattern. Below, you can see a datetime variable with invalid date formats or patterns. The test paramaters are set to only accept patterns that follow DD/MM/YYY HH:MM.

validity_pattern_conditions = (df_DQ_full['Validity_Pattern_|_Datetime_Event1']>0)
df_DQ_full[['Datetime_Event1','Validity_Pattern_|_Datetime_Event1']].loc[(validity_pattern_conditions)]

	Datetime_Event1	Validity_Pattern_\|_Datetime_Event1
100	1991/03/2024 00:00:00	1
101	1991/03/2024 00:00:00	1
102	1991/03/2024 00:00:00	1
103	1991/03/2024 00:00:00	1
104	1991/03/2024 00:00:00	1
105	1991/03/2024 00:00:00	1
106	1991/03/2024 00:00:00	1
107	1991/03/2024 00:00:00	1
108	1991/03/2024 00:00:00	1

	Completeness_NULL_\|_Patient_ID	Completeness_NULL_\|_Gender	Completeness_NULL_\|_Date_of_Diagnosis	Completeness_NULL_\|_Date_of_Birth	Completeness_NULL_\|_Age	Completeness_NULL_\|_Weight_kg	Completeness_NULL_\|_Height_cm	Completeness_NULL_\|_BMI	Completeness_NULL_\|_Postcode	Completeness_NULL_\|_ICD_10_Code	...	validity_count_\|_Datetime_Event2	validity_count_\|_Datetime_Logging1	validity_count_\|_Datetime_Logging2	validity_count_\|_Gender	validity_count_\|_Height_cm	validity_count_\|_Postcode	validity_count_\|_Regimen_Start_Date	validity_count_\|_Tumour_M_Stage	validity_count_\|_Tumour_Stage	validity_count_\|_Weight_kg
0	0	0	0	0	0	1	0	0	0	0	...	0	0	0	0	1	0	1	0	0	0
1	0	0	1	0	0	0	0	0	0	0	...	0	0	0	0	0	0	1	0	0	1
2	0	0	0	1	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
3	0	0	0	0	0	0	0	0	0	0	...	0	1	0	0	0	0	0	0	0	1
4	0	0	0	0	1	0	0	0	0	0	...	0	0	0	0	1	0	0	0	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
104	0	0	0	0	0	0	0	0	0	0	...	0	0	1	0	0	0	0	0	0	0
105	0	0	0	0	0	0	1	0	0	0	...	0	0	1	0	0	0	0	0	1	1
106	0	0	0	0	0	0	0	0	0	0	...	0	0	1	0	1	0	0	0	0	0
107	0	0	0	0	0	0	0	0	0	0	...	0	0	1	0	0	0	0	0	0	1
108	0	0	0	0	0	0	0	0	0	0	...	0	0	1	0	0	0	0	0	1	0

	Completeness_NULL_\|_Patient_ID	Completeness_NULL_\|_Gender	Completeness_NULL_\|_Date_of_Diagnosis	Completeness_NULL_\|_Date_of_Birth	Completeness_NULL_\|_Age	Completeness_NULL_\|_Weight_kg	Completeness_NULL_\|_Height_cm	Completeness_NULL_\|_BMI	Completeness_NULL_\|_Postcode	Completeness_NULL_\|_ICD_10_Code	...	validity_count_\|_Datetime_Event2	validity_count_\|_Datetime_Logging1	validity_count_\|_Datetime_Logging2	validity_count_\|_Gender	validity_count_\|_Height_cm	validity_count_\|_Postcode	validity_count_\|_Regimen_Start_Date	validity_count_\|_Tumour_M_Stage	validity_count_\|_Tumour_Stage	validity_count_\|_Weight_kg
0	0	0	0	0	0	1	0	0	0	0	...	0	0	0	0	1	0	1	0	0	0
1	0	0	1	0	0	0	0	0	0	0	...	0	0	0	0	0	0	1	0	0	1
2	0	0	0	1	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	1
3	0	0	0	0	0	0	0	0	0	0	...	0	1	0	0	0	0	0	0	0	1
4	0	0	0	0	1	0	0	0	0	0	...	0	0	0	0	1	0	0	0	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
104	0	0	0	0	0	0	0	0	0	0	...	0	0	1	0	0	0	0	0	0	0
105	0	0	0	0	0	0	1	0	0	0	...	0	0	1	0	0	0	0	0	1	1
106	0	0	0	0	0	0	0	0	0	0	...	0	0	1	0	1	0	0	0	0	0
107	0	0	0	0	0	0	0	0	0	0	...	0	0	1	0	0	0	0	0	0	1
108	0	0	0	0	0	0	0	0	0	0	...	0	0	1	0	0	0	0	0	1	0

Backend Python Tutorial

Contents